[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks
Description
TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.
See our X thread and full paper for details.
Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user's gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.Summary
- We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user's gender. Models are trained to apply this secret [...]
---
Outline:
(01:05 ) Summary
(02:24 ) Introduction
---
First published:
October 2nd, 2025
Source:
https://www.lesswrong.com/posts/Mv3yg7wMXfns3NPaz/eliciting-secret-knowledge-from-language-models-1
Linkpost URL:
https://arxiv.org/abs/2510.01070
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.